This workshop focuses on tools for spatial computation in R. Topics covered are:

Spatial computation refers to data wrangling with spatial data. It enables the construction of new datasets and measures by joining otherwise un-linked data across space or measuring the space between objects.

Useful references:

Note that this workshop does not focus on spatial statistics. Spatial statistics refers to a set of statistical models that account for spatial autocorrelation. Many analyses that require spatial statistics involve very little spatial computation, and many analyses that rely on spatial computation do not require any spatial statistics.

This article is a good introduction to spatial statistics, if people want to learn more about it. Also, in a previous iteration of this workshop, my friend and colleague Chris Hess did a second half of this workshop about spatial statistics, which you can find here.

Examples

1. Ang & Tebes, 2021. Civic Responses to Police Violence

Research Question: Do police shootings affect election turnout in nearby neighborhoods?

Data:

  • Voting data from California statewide voter registration database
  • Police shootings from LA Times Homicide Database

Methods:

  • Geocode addresses from both data sources
  • Assign each voter and shooting to a Census block (spatial join)
  • Create a block-level panel with voting behavior and distance from shooting (distance)

Results

Police shootings lead to increased turnout in subsequent elections, but only in the same Census block where the shooting occurs.

2. Brown and Enos, 2022. The Measurement of Partisan Sorting for 180 Million Voters

Research Question: To what extent is America geographically segregated by partisanship?

Data: National voter registration database with 180 million registered voters

Methods:

  • Geocode each voter’s address
  • Find each voter’s 1000 nearest neighbors (distance)
  • Compute spatial exposure and isolation based on these neighbors

Results:

Extremely granular measure of partisan segregation across the country.

3. Decter-Frain, Sachdeva, Collingwood, Burke, Murayama, Barreto, Wood, Henderson, & Zingher. (2022). Comparing Methods for Estimating Demographics in Racially Polarized Voting Analyses

Research Question: How should we measure voter turnout by race/ethnicity at the precinct level?

Data: Voter registration records following the 2018 Georgia Gubernatorial election

Methods: Compare two spatial computation approaches

  1. Spatial interpolation from Census data
  • “Convert” block-level racial composition to precinct geographic boundaries (interpolation)
  1. Bayesian Improved Surname Geocoding (BISG)
  • Geocode voter addresses
  • Assign voters to blocks (spatial join)
  • Using their block and surname, get a probability of race/ethnicity for each voter
  • Aggregate probabilities to the precinct level

Results:

BISG is more accurate as long as precincts are sufficiently homogeneous. In more diverse precincts, bias from BISG gets worse than the noise from interpolation.

Interesting sources of spatial data

Coding

# Set working directory
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

# Load packages
suppressPackageStartupMessages({
  library(tidyverse)
  library(tidycensus)
  library(areal)
  library(nngeo)
  library(patchwork)
  library(sf)
  library(censusxy)  
})

# Set plot theme
plot_theme <-
  theme_void() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 18, face = "bold"),
        axis.title.x = element_text(margin = margin(t = 12)),
        axis.title.y = element_text(margin = margin(r = 12)),
        plot.title = element_text(size = 18, face = "bold", hjust = 0.5))

sf objects

sf stands for simple features. Simple features are a formal standard through which real-world objects can be represented by computers. It is implemented by most spatial data software and databases. This includes probably all the spatial software you’ve heard of, including ArcGIS, GDAL, etc.

Because sf is a standard that underlies all of these tools, users can basically ignore filetypes or application-specific restrictions. If it’s spatial data, you can probably read it in with read_sf() / st_read().

sf vs. sp

R has two toolkits for spatial data analysis; sf and sp. sp is older, more well-established, and many people are more familiar with it. That said, I recommend learning sf instead of sp.

The Geocomputation with R text lists the following advantages:

  • Faster reading and writing of data
  • Enhanced plotting performance
  • sf objects can be treated as data frames in most operations
  • sf functions can be integrated using %>% into tidyverse pipe sequences
  • sf function names are relatively consistent and intuitive

Moreover, from the sf documentation, it “aims at succeeding sp in the long term”. The list of tasks for which sp outperforms sf is dwindling, and will eventually be zero.

# Read in data
sf_precincts <- read_sf("data/fulton_gwinnett.gpkg")
class(sf_precincts)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
head(sf_precincts)
## Simple feature collection with 6 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -12511.88 ymin: 412311.1 xmax: -4226.666 ymax: 436447.4
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 x 17
##   precinct_id_2018 county abrams_prop kemp_prop metz_prop total_votes
##   <chr>            <chr>        <dbl>     <dbl>     <dbl>       <dbl>
## 1 Fulton,08P       Fulton       0.831    0.159    0.00974         924
## 2 Fulton,Ss09B     Fulton       0.459    0.526    0.0143         2445
## 3 Fulton,03A       Fulton       0.981    0.0145   0.00435         690
## 4 Fulton,07J       Fulton       0.647    0.334    0.0195         1745
## 5 Fulton,09E       Fulton       0.946    0.0473   0.00636        1416
## 6 Fulton,12A       Fulton       0.957    0.0370   0.00586        2731
## # ... with 11 more variables: whi_true_total <dbl>, bla_true_total <dbl>,
## #   his_true_total <dbl>, asi_true_total <dbl>, oth_true_total <dbl>,
## #   whi_true <dbl>, bla_true <dbl>, his_true <dbl>, asi_true <dbl>,
## #   oth_true <dbl>, geom <MULTIPOLYGON [m]>

The sf object behaves exactly like a tibble, except it has one extra column, geom, containing spatial information. Aside from messing with the geom column, you can pretty much treat the sf object like a data frame. For example, I can get the margin of victory for Stacey Abrams over Brian Kemp in large precincts usng basic dplyr verbs.

margin_large_precincts <- sf_precincts %>%
    filter(total_votes > 4000) %>%
    select(precinct_id_2018, county, ends_with("prop"), total_votes) %>%
    mutate(abrams_pct_margin = (abrams_prop - kemp_prop)*total_votes)
class(margin)
## [1] "function"
margin_large_precincts
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -30873.14 ymin: 397015.3 xmax: 324.0863 ymax: 458847.4
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 x 8
##   precinct_id_2018 county abrams_prop kemp_prop metz_prop total_votes
## * <chr>            <chr>        <dbl>     <dbl>     <dbl>       <dbl>
## 1 Fulton,Sc15      Fulton       0.971    0.0278   0.00166        4209
## 2 Fulton,07A       Fulton       0.567    0.422    0.0118         4077
## 3 Fulton,Uc02A     Fulton       0.971    0.0268   0.00233        4287
## 4 Fulton,06D       Fulton       0.802    0.186    0.0119         6046
## 5 Fulton,Rw09      Fulton       0.353    0.632    0.0143         4205
## 6 Fulton,Rw12      Fulton       0.401    0.581    0.0183         4218
## # ... with 2 more variables: geom <MULTIPOLYGON [m]>, abrams_pct_margin <dbl>

Some operations will revert the object to a dataframe. A common case is a join. When executing a join, R will always apply the class of the first object entered in the join. This can cause the sf class to get dropped.

Below I try to add in a FIPS code to the sf object with a join. If I do it putting the dataframe first, I’ll lose the sf class. I have to do the other way around.

countyfips <- data.frame(
    "county" = c("Fulton", "Gwinnett"),
    "FIPS" = c("13121", "13135")
)
sf_georgia_w_fips = right_join(countyfips, sf_precincts)
## Joining, by = "county"
class(sf_georgia_w_fips)
## [1] "data.frame"
sf_georgia_w_fips = right_join(sf_precincts, countyfips)
## Joining, by = "county"
class(sf_georgia_w_fips)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

The geom column

What’s going on in the special geometry column?

sf_precincts %>%
    select(geom) %>%
    head()
## Simple feature collection with 6 features and 0 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -12511.88 ymin: 412311.1 xmax: -4226.666 ymax: 436447.4
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 x 1
##                                                                             geom
##                                                               <MULTIPOLYGON [m]>
## 1 (((-6803.698 422854.3, -6879.647 422779.3, -6904.589 422757.3, -6972.799 4226~
## 2 (((-5618.893 432298.6, -5605.268 432303.6, -5590.276 432307.6, -5574.961 4323~
## 3 (((-9733.036 419134, -9729.344 419120.6, -9722.918 419075.5, -9720.44 419053.~
## 4 (((-5155.174 425017.6, -5151.01 425008.7, -5140.385 424985.8, -5132.75 424974~
## 5 (((-10372.33 420976.9, -10575.86 420923.2, -10635.16 420915.8, -10607.59 4209~
## 6 (((-9778.072 412374.1, -9769.025 412456.7, -9764.948 412507.3, -9758.865 4125~

The geom column contains a vector of multidimensional points describing the geographic shape. There are several geometry types, the full list of which can be found in the texts above. In my experience, most social science work involves POINT, POLYGON and MULTIPOLYGON types.

Points are just points

points <- rbind(c(1,1), c(1, 2), c(2, 1), c(2, 2))
points <- st_multipoint(points)
plot(points)

Polygons are built up of linestrings

polygon_strings <- rbind(c(1,1), c(1, 2), c(2, 2), c(2, 1), c(1, 1))
polygon_list = list(polygon_strings)
plot(st_polygon(polygon_list))

Multipolygons are collections of polygons. This enable describing multiple distinct separate polygons within the same row of an sf object. They would be the right type if, for example, you wanted to plot Japan alongside other countries. Multipolygons can also have holes, whereas polygons cannot.

polygon_strings_2 <- rbind(c(3,1), c(3, 2), c(4, 2), c(4, 1), c(3, 1))
multipolygon = st_polygon(list(polygon_strings, polygon_strings_2))
plot(multipolygon)

In general, R will take care of defining the geometries for you. Usually you’ll know to expect either points or polygons based on the nature of your data. For instance, traffic stops or schools will probably be stored as points, while school districts or election precincts will be polygons.

Also in general, you might find polygons and multipolygons being used interchangeably. In this dataset all the precincts are labelled as multipolygons even though most of them could be polygons. As far as I know, this is fine. Multipolygon accepts a broader range of shapes and won’t break if two precincts happen to be separated.

What exactly are the numbers? These depend on your coordinate reference system, which we discuss below.

Mapping

sf is well-integrated with ggplot. This makes map-construction in R pretty easy, at least for simple plots.

ggplot(sf_precincts) +
    geom_sf(color = "white", aes(fill = county)) + 
    plot_theme

geom_sf() behaves much like other plot types. You can set the fills equal to aesthetics easily. Let’s see where votes for Stacey Abrams are concentrated…

ggplot(sf_precincts) +
    geom_sf(color = "NA", aes(fill = abrams_prop)) + 
    plot_theme

To visualize the borders between the counties, we can take advantage of the fact that summarize dissolves geographic areas.

sf_counties <- sf_precincts %>%
    group_by(county) %>%
    summarize() %>%
    nngeo::st_remove_holes()

head(sf_counties)
## Simple feature collection with 2 features and 1 field
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -49386.5 ymin: 391000.8 xmax: 49071.09 ymax: 466262
## Projected CRS: NAD83(2011) / Georgia East
##     county                       geometry
## 1   Fulton MULTIPOLYGON (((-36385.64 3...
## 2 Gwinnett POLYGON ((24143.27 420055.6...
ggplot() +
    geom_sf(data = sf_precincts, color = "NA", aes(fill = abrams_prop)) +
    geom_sf(data = sf_counties, color = 'grey', size = 1.5, fill = "NA") + 
    plot_theme

When working with data in the US, it’s helpful to know that the tidycensus package can return sf objects when you use to query the API. For instance, what if we wanted to place these counties within the state of Georgia?

# Note that to run this yourself you will need to set up an API key. See here:
# https://walker-data.com/tidycensus/articles/basic-usage.html
sf_georgia <- get_acs(
    geography = "state",
    year = 2018,
    variables = "B00001_001",
    state = "Georgia",
    geometry = TRUE
)
## Getting data from the 2014-2018 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
sf_georgia
## Simple feature collection with 1 feature and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -85.60516 ymin: 30.35785 xmax: -80.83973 ymax: 35.00066
## Geodetic CRS:  NAD83
##   GEOID    NAME   variable estimate                       geometry
## 1    13 Georgia B00001_001   651000 MULTIPOLYGON (((-81.27939 3...
ggplot() +
    geom_sf(data = sf_precincts, color = "NA", fill = "black") + 
    geom_sf(data = sf_georgia, color = "black", fill = "NA") +
    plot_theme

Coordinate Reference Systems (CRS)

TLDR; CRS defines the units for your points.

The two-dimensional planes upon which the above maps are plotted makes the units of maps seem straightforward, when in reality they are not. We have known for some time that the earth is not a flat two-dimensional plane. It is round, more of an ellispoid than a sphere, and covered with peaks and valleys.

CRS’s are different measurement systems that have different strengths and weaknesses when it comes to handling the complexity of the Earth’s shape. All of them have some risk of being innaccurate, and this might matter if you are doing highly precise spatial operations. There are lots of different CRS’s you can read about here.

Also, A CRS can be ‘geographic’ or ‘projected’. A geographic CRS retains the true shape of the earth, namely that it is round. A projected CRS represents the earth’s surface in two dimensions, which makes plotting feasible.

sf objects typically come with a preassigned CRS:

st_crs(sf_precincts)
## Coordinate Reference System:
##   User input: NAD83(2011) / Georgia East 
##   wkt:
## PROJCRS["NAD83(2011) / Georgia East",
##     BASEGEOGCRS["NAD83(2011)",
##         DATUM["NAD83 (National Spatial Reference System 2011)",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",6318]],
##     CONVERSION["SPCS83 Georgia East zone (meters)",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",30,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-82.1666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9999,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",200000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United States (USA) - Georgia - counties of Appling; Atkinson; Bacon; Baldwin; Brantley; Bryan; Bulloch; Burke; Camden; Candler; Charlton; Chatham; Clinch; Coffee; Columbia; Dodge; Echols; Effingham; Elbert; Emanuel; Evans; Franklin; Glascock; Glynn; Greene; Hancock; Hart; Jeff Davis; Jefferson; Jenkins; Johnson; Lanier; Laurens; Liberty; Lincoln; Long; Madison; McDuffie; McIntosh; Montgomery; Oglethorpe; Pierce; Richmond; Screven; Stephens; Taliaferro; Tattnall; Telfair; Toombs; Treutlen; Ware; Warren; Washington; Wayne; Wheeler; Wilkes; Wilkinson."],
##         BBOX[30.36,-83.47,34.68,-80.77]],
##     ID["EPSG",6444]]

This is a CRS I selected. I don’t understand most of this description. I do know this is part of a series of NAD83 Projected CRS’s. NAD83 is one of the two most popular CRS’s. The other is WGS84. WGS84 is what’s called a geocentric datum, meaning its center is the center of the Earth and it is agnostic to abnormalities on surface of the Earth. In contrast, NAD83 is a geocentric datum, meaning it is optimized differently for use in different regions. I picked one that is optimized for use in Georgia.

This CRS was set by running st_transform(sf_precincts, 6444). 6444 is the epsg code for the CRS. epsg codes are the easy way to CRSs in R. The other is to write a proj4string, which involves writing out the specifics of the CRS in detail. I’ve never done this before.

The key takeaway

CRS and projections are a very complicated topic and there’s a lot of options available. I think it’s easy to get bogged down in this. Probably the best approach is to find a CRS/projection combination that works well for your data and be consistent about applying it. As long as you do this, the remainder of your workflow should go smoothly. You can see a list of the available epsg codes using rgdal::make_EPSG()

crs_data <- rgdal::make_EPSG()
head(crs_data)
##   code   note                                                   prj4 prj_method
## 1 3819 HD1909         +proj=longlat +ellps=bessel +no_defs +type=crs     (null)
## 2 3821  TWD67        +proj=longlat +ellps=aust_SA +no_defs +type=crs     (null)
## 3 3822  TWD97 +proj=geocent +ellps=GRS80 +units=m +no_defs +type=crs     (null)
## 4 3823  TWD97          +proj=longlat +ellps=GRS80 +no_defs +type=crs     (null)
## 5 3824  TWD97          +proj=longlat +ellps=GRS80 +no_defs +type=crs     (null)
## 6 3887   IGRS +proj=geocent +ellps=GRS80 +units=m +no_defs +type=crs     (null)
crs_data %>% 
    filter(code == 6444)
##   code                       note
## 1 6444 NAD83(2011) / Georgia East
##                                                                                                                   prj4
## 1 +proj=tmerc +lat_0=30 +lon_0=-82.1666666666667 +k=0.9999 +x_0=200000 +y_0=0 +ellps=GRS80 +units=m +no_defs +type=crs
##            prj_method
## 1 Transverse Mercator

Geocoding

Lots of projects involving spatial analyses begin from address data. Addresses must be converted into points via geocoding. There are many geocoding services available and the options are always changing. For the most part, this is a commercial market. Some common alternatives include:

The Census API is generally very easy to use. I won’t do a full demo because I don’t want to make a ton of addresses from a real dataset public. Here’s a quick demo with one address line.

geo <- cxy_single(street = "171 East State Street", city = "Ithaca", state = "NY", zip = 14850)
geo
##   tigerLine.side tigerLine.tigerLineId coordinates.x coordinates.y
## 1              R              18857717     -76.49751      42.43958
##   addressComponents.zip addressComponents.streetName addressComponents.preType
## 1                 14850                        STATE                          
##   addressComponents.city addressComponents.preDirection
## 1                 ITHACA                              E
##   addressComponents.suffixDirection addressComponents.fromAddress
## 1                                                             101
##   addressComponents.state addressComponents.suffixType
## 1                      NY                           ST
##   addressComponents.toAddress addressComponents.suffixQualifier
## 1                         199                                  
##   addressComponents.preQualifier                    matchedAddress
## 1                                171 E STATE ST, ITHACA, NY, 14850

Then to convert a geocoded address to an sf object:

geo_sf <- st_as_sf(geo, coords = c('coordinates.x', 'coordinates.y'))
head(geo_sf)
## Simple feature collection with 1 feature and 15 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -76.49751 ymin: 42.43958 xmax: -76.49751 ymax: 42.43958
## CRS:           NA
##   tigerLine.side tigerLine.tigerLineId addressComponents.zip
## 1              R              18857717                 14850
##   addressComponents.streetName addressComponents.preType addressComponents.city
## 1                        STATE                                           ITHACA
##   addressComponents.preDirection addressComponents.suffixDirection
## 1                              E                                  
##   addressComponents.fromAddress addressComponents.state
## 1                           101                      NY
##   addressComponents.suffixType addressComponents.toAddress
## 1                           ST                         199
##   addressComponents.suffixQualifier addressComponents.preQualifier
## 1                                                                 
##                      matchedAddress                   geometry
## 1 171 E STATE ST, ITHACA, NY, 14850 POINT (-76.49751 42.43958)

Spatial operations

Next, we’ll work through examples of common spatial operations. A bunch of them require some point data, so we’ll use this dataset with point-located sports venues throughout Georgia.

sf_sports <- read_sf("data/sports_venues.gpkg")
head(sf_sports)
## Simple feature collection with 6 features and 27 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -9394569 ymin: 3650717 xmax: -9062026 ymax: 4022064
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 6 x 28
##   OBJECTID VENUEID NAME   ADDRESS CITY  STATE ZIP   ZIP4  TELEPHONE TYPE  STATUS
##      <int>   <dbl> <chr>  <chr>   <chr> <chr> <chr> <chr> <chr>     <chr> <chr> 
## 1       72      99 BOBBY~ 155 NO~ ATLA~ GA    30332 NOT ~ (404) 89~ SING~ OPEN  
## 2       73     100 SANFO~ SANFOR~ ATHE~ GA    30605 NOT ~ (706) 54~ SING~ OPEN  
## 3      204     240 GEORG~ 755 HA~ ATLA~ GA    30315 NOT ~ (404) 41~ SING~ OPEN  
## 4      239     276 EAST ~ 2575 A~ ATLA~ GA    30317 NOT ~ (404) 37~ SING~ OPEN  
## 5      240     277 AUGUS~ 807 EI~ AUGU~ GA    30904 NOT ~ (706) 73~ SING~ OPEN  
## 6      241     278 SEA I~ 100 RE~ ST S~ GA    31522 NOT ~ (800) 73~ SING~ OPEN  
## # ... with 17 more variables: POPULATION <dbl>, COUNTY <chr>, COUNTYFIPS <chr>,
## #   COUNTRY <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, NAICS_CODE <chr>,
## #   NAICS_DESC <chr>, SOURCE <chr>, SOURCEDATE <date>, VAL_METHOD <chr>,
## #   VAL_DATE <date>, NAME2 <chr>, NAME3 <chr>, FORMERNM <chr>, YEARCHNG <chr>,
## #   geom <POINT [m]>

If we want to use this sf object together with our election precincts one, we first need to line up their CRS.

sf_sports <- st_transform(sf_sports, st_crs(sf_precincts))
st_crs(sf_sports) == st_crs(sf_precincts)
## [1] TRUE

Where are these sports stadiums?

sf_georgia <- st_transform(sf_georgia, st_crs(sf_precincts))

ggplot() +
    geom_sf(data = sf_precincts, color = "NA", fill = "black") + 
    geom_sf(data = sf_georgia, color = "black", fill = "NA") +
    geom_sf(data = sf_sports, color = "red", size = 2) +
    plot_theme

Filtering

We can get rid of the ones not inside the counties we care about using a filter.

sf_sports_gf <- st_filter(sf_sports, sf_precincts)
ggplot() +
    geom_sf(data = sf_precincts, color = "NA", fill = "black") + 
    geom_sf(data = sf_georgia, color = "black", fill = "NA") +
    geom_sf(data = sf_sports, color = "red", size = 2) +
    geom_sf(data = sf_sports_gf, color = "green", size = 4) +
    plot_theme

Joining points to polygons

At this point it’s still going to be hard to do any analysis because the information we’re interested in is split across two objects. The sports stadiums are in sf_sports, and the precincts are in sf_precincts. If we want to analyse the effect of having a sports stadium on a precincts’ election results (we don’t, but like if we pretend we do), we need to get a variable in the sf_precincts object that says whether or not a precinct contains a sports stadium.

ggplot() +
    geom_sf(data = sf_precincts, color = "grey", fill = "white") + 
    geom_sf(data = sf_sports_gf, color = "red", size = 1) +
    plot_theme

To get them together, we can do a spatial join

sf_precincts$n_stadiums <- lengths(st_intersects(sf_precincts, sf_sports_gf))
ggplot() +
    geom_sf(data = sf_precincts, color = "grey", aes(fill = as.factor(n_stadiums))) + 
    geom_sf(data = sf_sports_gf, color = "red", size = 1) +
    scale_fill_brewer(type = "seq") +
    plot_theme

Distances

Now let’s try computing the distance from each precinct to the nearest stadium. The first problem we face is that we can theoretically choose any point within a polygon from which to compute a distance.

Centroids

Typically people use centroids for this. You can think of centroids as points at the middle of the polygon.

In actuality it’s a bit more complicated than that. A the most straightforward centroid location will be computed by drawing a box around the maximum and minimum heights and widths of a polygon, then taking the direct center of the box. However, you can imagine situations where this method ends up picking a centroid that isn’t even in the intended polygon, like if the polygon is L-shaped. There are various other formulations that modify the basic computation to avoid this problem, and which one you choose may affect results in some instances. In most datasets, the choice of centroid computation probably matters for some small (but maybe important) subset of the shapes.

When computing centroids with sf

  • Centroids are computed assuming data lie on a 2D plane. This means you should use a projected CRS when you’re computing these.
  • st_centroid() computes the centroid as described above.
  • st_point_on_surface() computes a variation of centroids that ensures an area’s centroid is within its boundaries.
sf_precinct_centroids <- st_centroid(sf_precincts)
## Warning in st_centroid.sf(sf_precincts): st_centroid assumes attributes are
## constant over geometries of x
sf_precinct_centsonsurface <- st_point_on_surface(sf_precincts)
## Warning in st_point_on_surface.sf(sf_precincts): st_point_on_surface assumes
## attributes are constant over geometries of x
ggplot() +
    geom_sf(data = sf_precincts, color = "grey", fill = "white") + 
    geom_sf(data = sf_precinct_centroids, color = "red", size = 1) +
    geom_sf(data = sf_precinct_centsonsurface, color = "blue", size = 1) +
    plot_theme

Getting distance to the nearest stadium st_nearest_feature() takes in to sf objects and returns a vector of row indices from the the second object that are closest to each shape in the first.

nearest <- st_nearest_feature(sf_precinct_centroids, sf_sports)
nearest[1:10]
##  [1]  9 19 18  9  9  3  3 18  9 17

So the first precinct centroid is closest to the stadium in the ninth row of the sports object, and so on. We can do some old-fashioned indexing to get the distances between these points, then add it back to sf_precincts

dist_to_nearest <- st_distance(sf_precinct_centroids, sf_sports[nearest,], by_element = TRUE)
sf_precincts$dist_to_nearest <- as.numeric(dist_to_nearest)

ggplot() +
    geom_sf(data = sf_precincts, color = "NA", aes(fill = dist_to_nearest/1000)) +
    geom_sf(data = sf_sports_gf, color = "red", size = 1) +
    scale_fill_gradient2(low = "black", mid = "grey10", high = "white") +
    plot_theme

Interpolation

Sometimes you want a measure in one geographic unit, but you only have it in another. This is like the paper above. We can use areal-weighted interpolation to crudely transfer the measure from one spatial unit to another.

Interpolation is crude because it relies on the assumption that whatever characteristic of an area you’re interpolating is spread evenly across the initial geographic surface. This is a very strong assumption that almost never holds. In this case, the populations of different race groups are most certainly not spread across census areas. Nonetheless, this tool can be useful when there are no good alternatives.

Here we’re going to grab data on the voting-aged populations of different race groups at the block-group level and interpolate this over to the election precincts. We could use Census blocks, but it would be slower.

georgia_counties <- c("121", "135")

sf_vap <- purrr::map_dfr(
  georgia_counties,
  ~ tidycensus::get_decennial(
    geography = "block group",
    variables = c(
      "P011002", # total hispanic, 18+
      "P011005", # total white, 18+
      "P011006", # total black, 18+
      "P011008"  # total asian, 18+
    ),
    year = 2010,
    summary_var = "P011001", # total population, 18+,
    state = "GA",
    county = .,
    geometry = TRUE,
    cache = TRUE,
    output = "wide"
  )
)
## Getting data from the 2010 decennial Census
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using Census Summary File 1
sf_vap <- sf::st_transform(sf_vap, sf::st_crs(sf_precincts))
head(sf_vap)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -8255.715 ymin: 421312 xmax: -2039.175 ymax: 424103.3
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 x 8
##   GEOID        NAME                P011002 P011005 P011006 P011008 summary_value
##   <chr>        <chr>                 <dbl>   <dbl>   <dbl>   <dbl>         <dbl>
## 1 131210001001 Block Group 1, Cen~      13     578       8      11           617
## 2 131210002001 Block Group 1, Cen~      43     779     100      30           969
## 3 131210002005 Block Group 5, Cen~      30     653      25      23           745
## 4 131210006001 Block Group 1, Cen~     243    2289    1170    1113          4970
## 5 131210010011 Block Group 1, Cen~      65     519      85     741          1471
## 6 131210011001 Block Group 1, Cen~      57     784     114      72          1052
## # ... with 1 more variable: geometry <MULTIPOLYGON [m]>

We can see that the two maps have different boundary lines

ggplot() +
  geom_sf(data = sf_precincts, fill = NA, color = 'red', size = .2) +
  geom_sf(data = sf_vap, fill = NA, color = 'blue', size = .2) +
  plot_theme

We can interpolate from one dataset to the other. There is a built-in aw_interpolate function is sf, but I have found the version from the areal package tends to perform better, at least for this problem.

sf_precincts <- areal::aw_interpolate(
  sf_precincts,
  tid = precinct_id_2018,
  source = sf_vap,
  sid = GEOID,
  output = "sf",
  weight = "sum",
  extensive = c(
    "P011002", "P011005",
    "P011006", "P011008", "summary_value"
  )
)

Here we do some processing to construct measures of racial composition.

sf_precincts <- sf_precincts %>%
  dplyr::rename(
    "whi_2010_vap_total" = "P011005",
    "bla_2010_vap_total" = "P011006",
    "his_2010_vap_total" = "P011002",
    "asi_2010_vap_total" = "P011008",
    "vap_total" = "summary_value"
  )

sf_precincts$oth_2010_vap_total <- sf_precincts$vap_total -
  sf_precincts$whi_2010_vap_total -
  sf_precincts$bla_2010_vap_total -
  sf_precincts$his_2010_vap_total -
  sf_precincts$asi_2010_vap_total

sf_precincts <- sf_precincts %>%
  dplyr::mutate(oth_2010_vap_block_total = ifelse(
    oth_2010_vap_total < 0,
    0,
    oth_2010_vap_total
  )) %>%
  dplyr::mutate(
    "whi_2010_vap_prop" = ifelse(vap_total == 0, 0, whi_2010_vap_total / vap_total),
    "bla_2010_vap_prop" = ifelse(vap_total == 0, 0, bla_2010_vap_total / vap_total),
    "his_2010_vap_prop" = ifelse(vap_total == 0, 0, his_2010_vap_total / vap_total),
    "asi_2010_vap_prop" = ifelse(vap_total == 0, 0, asi_2010_vap_total / vap_total),
    "oth_2010_vap_prop" = ifelse(vap_total == 0, 0, oth_2010_vap_total / vap_total)
  )

Now we can compare the composition estimates from interpolation to the actual self-reported race estimates

p1 <- ggplot(sf_precincts) +
    geom_sf(aes(fill = whi_true)) +
    scale_fill_gradient2(low = "black", mid = "grey10", high = "white") +
    plot_theme +
    ggtitle("Self-Reported") +
    theme(legend.position = "None",
          axis.text = element_blank())

p2 <- ggplot(sf_precincts) +
    geom_sf(aes(fill = whi_2010_vap_prop)) +
    scale_fill_gradient2(low = "black", mid = "grey10", high = "white") +
    plot_theme +
    ggtitle("Interpolated") +
    theme(legend.position = "None",
          axis.text = element_blank())

p1 + p2